Skip to content

Conversation

casteryh
Copy link
Contributor

No description provided.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 13, 2025
Copy link
Contributor

@LucasLLC LucasLLC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to changing the transport buffer interface, but the burden of this PR is to:

  • ensure resharding tests don't break (and that we don't drop support for contiguous support)
  • prove this is better
  • prove that this is faster

Arguably the interface is too biased towards rdma buffer, but any new interface needs to be generically supported against all backends (incl gloo in flight).

If your goal is to make rdma buffer faster, can you use test_models.py and test whether this is faster?

await self.storage_volume.get.call_one(
key, transport_buffer, request.meta_only()
)
transport_buffer = await self.storage_volume.get.call_one(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're creating a race condition here -- memory is often created on the fly in storage volume to deal with non-contiguous tensors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In storage volume, all the tensors are already contiguous, and it's just handing out RDMABuffers pointing to those tensors.

@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 70.96774% with 9 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@871fa11). Learn more about missing BASE report.

Files with missing lines Patch % Lines
torchstore/transport/buffers.py 58.82% 7 Missing ⚠️
torchstore/storage_volume.py 88.88% 1 Missing ⚠️
torchstore/transport/pipe.py 80.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main      #57   +/-   ##
=======================================
  Coverage        ?   61.01%           
=======================================
  Files           ?       22           
  Lines           ?     1698           
  Branches        ?        0           
=======================================
  Hits            ?     1036           
  Misses          ?      662           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@casteryh
Copy link
Contributor Author

casteryh commented Oct 13, 2025

I'm open to changing the transport buffer interface, but the burden of this PR is to:

  • ensure resharding tests don't break (and that we don't drop support for contiguous support)

The test itself seems broken for me (it's either extremely slow or it hangs, been waiting for it for 10 minutes).
Yes it passed, it was indeed just slow, took 20 minutes to complete.
Ran integration tests in forge and it was passing for qwen 8b, trainer fsdp=2, policy tp=2.

  • prove this is better

I think TransportBuffer shouldn't need to allocate anything and shouldn't own anything. It's easier to reason about it if we simply treat it as a "remote reference" to a tensor of sorts.

  • prove that this is faster

Will need more evidence but see below for result with test_models.py

Arguably the interface is too biased towards rdma buffer, but any new interface needs to be generically supported against all backends (incl gloo in flight).

I agree, for example, with gloo, we can probably make non-contiguous tensor works without extra allocation. I think we can always change the interface later to add something less restrictive than from_contiguous_tensor.

If your goal is to make rdma buffer faster, can you use test_models.py and test whether this is faster?

Yes, on slurm single node. Put went from 7 seconds to 4.3 seconds. I will run 32B in forge to double check.
before: https://www.internalfb.com/phabricator/paste/view/P1991001240

[0] rank: 0 pushed state dict in 7.046610719058663 seconds
[0] rank: 0 got state dict in 4.545355102978647 seconds

after: https://www.internalfb.com/phabricator/paste/view/P1990993018

[0] rank: 0 pushed state dict in 4.3794964698608965 seconds
[0] rank: 0 got state dict in 4.107833093032241 seconds

@casteryh
Copy link
Contributor Author

From forge e2e run on slurm 32b multi-node:
without patch:

STEP 1
  policy_worker_perf/update_weights/total_duration_avg_s: 78.15236387704499
  policy_worker_perf/update_weights/total_duration_max_s: 83.50384995015338
  rl_trainer_perf/push_weights/total_duration_avg_s: 9.24261197517626
  rl_trainer_perf/push_weights/total_duration_max_s: 10.174692547880113
STEP 2
  policy_worker_perf/update_weights/total_duration_avg_s: 76.40055759515963
  policy_worker_perf/update_weights/total_duration_max_s: 82.62615168001503
  rl_trainer_perf/push_weights/total_duration_avg_s: 7.3707064733607695
  rl_trainer_perf/push_weights/total_duration_max_s: 8.557962979190052
STEP 3
  policy_worker_perf/update_weights/total_duration_avg_s: 76.61371284359484
  policy_worker_perf/update_weights/total_duration_max_s: 82.17285228613764
  rl_trainer_perf/push_weights/total_duration_avg_s: 7.345048399467487
  rl_trainer_perf/push_weights/total_duration_max_s: 8.0563137922436
STEP 4
  policy_worker_perf/update_weights/total_duration_avg_s: 76.53273953814642
  policy_worker_perf/update_weights/total_duration_max_s: 81.62242694292217
  rl_trainer_perf/push_weights/total_duration_avg_s: 7.770320081850514
  rl_trainer_perf/push_weights/total_duration_max_s: 8.817566874902695

with patch:

STEP 1
  policy_worker_perf/update_weights/total_duration_avg_s: 62.47022197701153
  policy_worker_perf/update_weights/total_duration_max_s: 69.53650338202715
  rl_trainer_perf/push_weights/total_duration_avg_s: 9.41419801331358
  rl_trainer_perf/push_weights/total_duration_max_s: 10.834439367055893
STEP 2
  policy_worker_perf/update_weights/total_duration_avg_s: 61.52622404671274
  policy_worker_perf/update_weights/total_duration_max_s: 70.85746217798442
  rl_trainer_perf/push_weights/total_duration_avg_s: 8.029894174134824
  rl_trainer_perf/push_weights/total_duration_max_s: 9.306584176141769
STEP 3
  policy_worker_perf/update_weights/total_duration_avg_s: 62.209209970140364
  policy_worker_perf/update_weights/total_duration_max_s: 72.19040106609464
  rl_trainer_perf/push_weights/total_duration_avg_s: 7.521832116821315
  rl_trainer_perf/push_weights/total_duration_max_s: 8.884237607009709
STEP 4
  policy_worker_perf/update_weights/total_duration_avg_s: 62.030980555253336
  policy_worker_perf/update_weights/total_duration_max_s: 72.34240108402446
  rl_trainer_perf/push_weights/total_duration_avg_s: 7.057374230294954
  rl_trainer_perf/push_weights/total_duration_max_s: 7.307658496778458

@casteryh
Copy link
Contributor Author

ptal @LucasLLC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants